biostatfandomcom-20200213-history
R programming
Post to Twitter Using R for stats Tools for editing R scripts Some good tools for using R R commander Pretty good tools to start using R and uses its basic methods. Tinn-R editor R studio Very powerful IDE based on the R language. 'Tips about R' *Do not use 'for' loops in R. They will take ages if you make too many iterations. Instead use the '*apply' methods (sapply, lapply, tapply, apply). lapply itterates along a list, sapply along a vector; apply and tapply have extra parameters to indicate which part of the array or vector to use. For apply it is the dimension of the array over which to iterate. Ex: apply(mymat,2,sum,na.rm=TRUE) will produce the sum of each column of the matrix, after missing values have been removed. tapply selects parts of a vector using factors tapply(myvector,myfactor,mean) will calculate the mean of values of myvector after splitting them into groups based on the different values of myfactor. As illustration, here is some code using sapply and apply testFPKM<-apply(mydat2:10,2:13,1,function(x{y1<-xc(1:4,11,12);y2<-xc(5:10);t.test(y1,y2)}) rawp0<-sapply(mydat,function(x) print(x$p.value)) *If you plot a very''' large number of points', it may be interesting to use the smoothing function scatter.smooth. This will speed up the plotting process and give you a density plot which may be very informative. *ifelse is very useful when dealing with vector. Ex: z<- ifelse(x,y/x,y/2) This gives zi <- yi/xi or zi <- yi/2 *To get several subplots (1 line of 2 subplots in this case), use par(mfrow=c(1,2)) *It is well known that maximum likelihood estimate in general tend to underestimate variance parameters because they fail to adjust for the fact that the mean is estimated from the same data. This will not have much impact if the number of samples is large. Data Manipulation http://www.stats.ox.ac.uk/~ruth/RCourse/Rcourse3.pdf Stuff to read table, play with matrices or vectors, get to some elements or parts of the matrix, create function, merge data frames. For matrix multiplication, use %*%. Inverses of square matrices can be found using the '''solve' command. eigen computes eigenvalues and eigenvectors of matrices. colSums, colMeans returns the column sums or means of a matrix. rowSums and rowMeans the sums or means of the rows. Can also be used on arrays http://cran.r-project.org/doc/contrib/usingR.pdf is a great tutorial to get started with the basics. Among other things, you will see what is the previous reference + how to plot, how to deal with NA elements in data frames, how to make simple plots, add lines to them, scatterplots (and its smooth version), histograms, boxplots, rugplots (see example below) . Other info is also listed in 'Data visualization ' 'Linear models' http://cran.r-project.org/doc/contrib/usingR.pdf Simple linear regression. You will find out what is behind regression objects in R, how to make model formula in R, multiple linear regression, polynomial and spline regression. Simple linear models lm(formula = loss ~ hard + tens, data = Rubber) To plot the results of the regression, use termplot(Rubber.lm, partial=TRUE, smooth=panel.smooth) You'll get something like this What order of polynomial? > seedrates.lm1<-lm(grain~rate,data=seedrates) > seedrates.lm2<-lm(grain~rate+I(rate^2),data=seedrates) > anova(seedrates.lm2,seedrates.lm1) Model 1: grain ~ rate + I(rate^2) Model 2: grain ~ rate Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 2 0.026286 2 3 0.187000 -1 -0.160714 12.228 0.07294 The F-value is large, but on this evidence there are too few degrees of freedom to make a totally convincing case for preferring a quadratic to a line. Comparing populations Normal distribution Classical cases: few samples, few variables You're a lucky guy ... your distro follows a normal distribution. If populations are normally distributed and the variances are equal we can use the t-test with the test statistic: t = \frac{\overline{y}_1 - \overline{y}_2 }{s \sqrt{1/n_1+1/n_2}} \tilde t_{n_1+n_2-2}, where s'' is the Standard deviation of the sample and ''n is the sample size. The degrees of freedom used in this test is n'' − 1. If the variances are not equal, it is preferable to use the test statistic (Welch test): t = \frac{\overline{y}_1 - \overline{y}_2 }{\sqrt{s_1^2/n_1+s_2^2/n_2}} \tilde t_{\nu}, If the data are paired (still normally distributed), we have the paired test statistic: \frac{\overline{d}}{s \sqrt{1/n}} \tilde t_{n-1} '''Wilcoxon Mann-Whitney rank sum test' applies the t-statistic to the joint ranks of all measurements in both groups instead of the original measurements. The Wilcoxon signed-rank statistic is based on the ranks of the absolute differences |di|. The statistic is defined as the sum of the ranks associated with positive difference di > 0. It should be noted that this test is only valid when the differences di are symmetrically distributed. An interesting tool for multiple testing allowing to use either of these methods is "multtest". After testing with mt.teststat(yourData,classlabel=yourClassLabels) You can plot the test statistics using qqplot to see whether your distribution follow a normal or to see variables with "unusual" test statistics. "multtest" also provides tools for multiple testing: The procedures include the Bonferroni, Holm (1979), Hochberg (1988), and Sidak procedures for strong control of the family–wise Type I error rate (FWER) (false negative), and the Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001) procedures for (strong) control of the false discovery rate (FDR, false positive). Adjusted p–values for these seven multiple testing procedures can be computed as follows and stored in the original variable order in adjp using order(res$index): procs <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD", "BH", "BY") res <- mt.rawp2adjp(rawp0, procs) adjp <- res$adjporder(res$index), where rawp0 are your p-values considering the null hypothesis. If you don't like these procedures and have some time to spend calculating permutations you can use mt.maxT and mt.minP. The mt.maxT and mt.minP functions compute permutation adjusted p–values for the maxT and minP step–down multiple testing procedure described in Westfall and Young (1993). There are thus in general less conservative than the standard Bonferroni procedure. This means that it allows more easily positive event reducing the false negative rate but potentially increasing the false positive rate. The permutation algorithm for the maxT and minP procedures is described in Ge et al. (2003). resT <- mt.maxT(yourData) More complex case : few samples, many variables If the distribution follows approximately a normal distribution and you're dealing with discrete and skewed data (for example when working with count data like in differential gene expression), you can use a test based on a negative binomial distribution. The advantage of this distribution over the Poisson distribution is that it allows to model biological variance correctly. In clear, if you plot the log2fold against the mean of your variable, you may see that the spead of the points is not equal from one variable to another (see plot below taken from the vignette of DESeq). In this case the Poisson distribution will not be able to model the variation in the variances between the variables. DESeq is a great tool to model almost any type of variance. EdgeR is also a great tool but will only model linear regression in the variance. In short, use DESeq or EdgeR when you deal with many variables, only a few samples and you expect the variance or dispersion to vary along the variables (ex: variance for gene 1 is different from variance for gene 2, etc). These packages are based on the so-called quantile-adjusted conditional maximum likelihood (qCML) which gives smaller bias than the maximum likelihood (ML). If the number of samples increases a conditional maximum likelihood (CML) estimate will be as good as the qCML. Not normal distribution Fisher-exact test can be used if you work with contingency tables where sample sizes are small. Use fisher.test in R. With large samples, a chi-square test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. Multiple test correction Suppose genes are our variables *'Bonferroni' Corrected P-value= p-value * n (number of genes in test) <0.05 The Bonferroni correction is the most stringent test of all, but offers the most conservative approach to control for false positives. *'Bonferroni Step-down (Holm) correction' #The p-value of each gene is ranked from the smallest to the largest. #The first p-value is multiplied by the number of genes present in the gene list: if the end value is less than 0.05, the gene is significant: Corrected P-value= p-value * n < 0.05 #The second p-value is multiplied by the number of genes - 1: Corrected P-value= p-value * n-1 < 0.05 #The third p-value is multiplied by the number of genes less 2: Corrected P-value= p-value * n-2 < 0.05 It follows that sequence until no gene is found to be significant. *'Westfall and Young Permutation' Both Bonferroni and Holm methods are called single-step procedures, where each p- value is corrected independently. The Westfall and Young permutation method takes advantage of the dependence structure between genes, by permuting all the genes at the same time. The Westfall and Young Permutation is a correction accounting for genes coregulation. However, it is very slow and is also very conservative. The Westfall and Young permutation follows a step-down procedure similar to the Holm method, combined with a bootstrapping method to compute the p-value distribution: #P-values are calculated for each gene based on the original data set and ranked. #The permutation method creates a pseudo-data set by dividing the data into artificial treatment and control groups. #P-values for all genes are computed on the pseudo-data set. #The successive minima of the new p-values are retained and compared to the original ones. #This process is repeated a large number of times, and the proportion of resampled data sets where the minimum pseudo-p-value is less than the original p-value is the adjusted p-value. Because of the permutations, the method is very slow. The Westfall and Young permutation method has a similar Family-wise error rate as the Bonferroni and Holm corrections. *'Benjamini and Hochberg False Discovery Rate' This correction is the least stringent than Bonferroni or Bonferroni Step-down (Holm) '''or '''Westfall and Young Permutation, and therefore tolerates more false positives. There will be also less false negative genes. It provides a good balance between discovery of statistically significant genes and limitation of false positive occurrences. Benjamini and Hochberg False Discovery Rate 'does not take into account gene corregulation. Here is how it works: #The p-values of each gene are ranked from the smallest to the largest. #The largest p-value remains as it is. #The second largest p-value is multiplied by the total number of genes in gene list divided by its rank. If less than 0.05, it is significant. Corrected p-value = p-value*(n/n-1) < 0.05, if so, gene is significant. #The third p-value is multiplied as in step 3: Corrected p-value = p-value*(n/n-2) < 0.05, if so, gene is significant. And so on. *'Benjamini–Hochberg–Yekutieli This correction does take into account that the tests are dependent. It is more conservative than the Benjamini-Hochberg multiple test correction. #The p-values of each gene are ranked from the smallest to the largest. #The largest p-value remains as it is. #The second largest p-value is multiplied by the total number of genes in gene list divided by its rank but corrected. If less than 0.05, it is significant. Corrected p-value = p-value*(n/((n-1)* \sum_{i=1}^{n-1}1/i ) < 0.05, if so, gene is significant. #The third is multiplied by the total number of genes in gene list divided by its rank but corrected. If less than 0.05, it is significant. Corrected p-value = p-value*(n/((n-2)* \sum_{i=1}^{n-2}1/i ) < 0.05, if so, gene is significant. And so on. R in real-life applications Drew Conway's analysis of the Afghanistan attacks data released by Wikileaks. Analyzing attack data may help localize enemy troops.